Skip to content

feat(sinker): Ship 8 — Nv24/Nv42 RGBA + Strategy A RGB→RGBA fan-out#20

Merged
uqio merged 1 commit intomainfrom
feat/ship8-rgba-nv24-nv42
Apr 26, 2026
Merged

feat(sinker): Ship 8 — Nv24/Nv42 RGBA + Strategy A RGB→RGBA fan-out#20
uqio merged 1 commit intomainfrom
feat/ship8-rgba-nv24-nv42

Conversation

@uqio
Copy link
Copy Markdown
Collaborator

@uqio uqio commented Apr 26, 2026

Tranche 4b of Ship 8 sink-side RGBA. Adds Nv24 / Nv42 (semi-planar 4:4:4) RGBA output via the dual-const-generic <SWAP_UV, ALPHA> template established by PR #17 (NV12 / NV21), and retro-applies a Strategy A combined RGB→RGBA fan-out to all 8 wired families so callers attaching both with_rgb and with_rgba no longer pay the per-pixel YUV→RGB math twice — addresses the Copilot review finding from PR #19 (src/sinker/mixed.rs:1648).

Scope

# Tranche Formats Status
1 4:2:0 planar Yuv420p ✅ shipped (PR #16)
2 4:2:0 semi-planar Nv12, Nv21 ✅ shipped (PR #17)
3 4:2:2 planar + semi-planar Yuv422p, Nv16 ✅ shipped (PR #18)
4a 4:4:4 planar Yuv444p ✅ shipped (PR #19)
4b 4:4:4 semi-planar Nv24, Nv42 this PR + Strategy A retro-applied to all 8 wired families
4c 4:4:0 planar Yuv440p next — wiring-only (reuses yuv_444_to_rgba_row)
5 High-bit-depth 4:2:0 Yuv420p9/10/12/14/16, P010/P012/P016
6 High-bit-depth 4:2:2 Yuv422p9/10/12/14/16, Yuv440p10/12, P210/P212/P216
7 High-bit-depth 4:4:4 Yuv444p9/10/12/14/16, P410/P412/P416

Usage:

```rust
use colconv::{
frame::Nv24Frame,
sinker::MixedSinker,
yuv::{Nv24, nv24_to},
ColorMatrix,
};

let frame = Nv24Frame::new(&y_plane, &uv_plane, w, h, w, 2 * w);
let mut rgb = vec![0u8; (w * h * 3) as usize];
let mut rgba = vec![0u8; (w * h * 4) as usize];
let mut sinker = MixedSinker::::new(w as usize, h as usize)
.with_rgb(&mut rgb)?
.with_rgba(&mut rgba)?;

// Both buffers requested → YUV→RGB math runs once, RGBA derived via the
// Strategy A fan-out (no double per-pixel cost).
nv24_to(&frame, /full_range=/ true, ColorMatrix::Bt709, &mut sinker)?;
```

What's in this PR

Public API

  • `MixedSinker::with_rgba(&mut [u8])` / `set_rgba` and `MixedSinker::with_rgba` / `set_rgba` — format-specific impl blocks.
  • `row::nv24_to_rgba_row(...)` and `row::nv42_to_rgba_row(...)` — public dispatchers paralleling the RGB variants.

Kernel work — NV24/NV42 RGBA

Mirrors PR #17 (NV12/NV21) shape: dual const generic `<const SWAP_UV: bool, const ALPHA: bool>` on a single shared `nv24_or_nv42_to_rgb_or_rgba_row_impl` kernel per backend, with 4 thin wrappers (NV24 RGB / NV42 RGB / NV24 RGBA / NV42 RGBA) forwarding the 4 `(SWAP_UV, ALPHA)` combinations. The compiler monomorphizes into 4 separate functions; the `if ALPHA` branch and the unused alpha-vector splat are DCE'd at each call site.

File What's added
`row/scalar.rs` NV24/NV42 RGBA + `<SWAP_UV, ALPHA>` template + `expand_rgb_to_rgba_row` helper
`arch/neon.rs` NV24/NV42 RGBA; uses `vst4q_u8` when `ALPHA = true`, `vst3q_u8` otherwise
`arch/x86_sse41.rs` NV24/NV42 RGBA; reuses `write_rgba_16` from PR #16
`arch/x86_avx2.rs` NV24/NV42 RGBA; reuses `write_rgba_32` from PR #16
`arch/x86_avx512.rs` NV24/NV42 RGBA; reuses `write_rgba_64` from PR #16
`arch/wasm_simd128.rs` NV24/NV42 RGBA; reuses wasm `write_rgba_16` from PR #16

Strategy A — combined RGB→RGBA fan-out

Tranches 1–4a wired RGB and RGBA as independent kernel calls — when a caller attached both `with_rgb` and `with_rgba`, `MixedSinker::process` ran the YUV→RGB per-pixel math twice. Copilot review of PR #19 (`src/sinker/mixed.rs:1648`) flagged this.

This PR addresses it by:

  1. Adding `pub(crate) fn expand_rgb_to_rgba_row(rgb, rgba_out, width)` in `row/scalar.rs` — memory-bound copy + `0xFF` alpha pad.
  2. Reworking each `MixedSinker::process` across all 8 wired families (Yuv420p, Yuv422p, Yuv444p, Nv12, Nv16, Nv21, Nv24, Nv42). Output mode resolution per row:
    • RGBA-only (no RGB / HSV): dedicated `*_to_rgba_row` kernel directly into the output buffer.
    • RGB / HSV (± RGBA): RGB kernel once into `rgb_row` (or `rgb_scratch`), then HSV derivation if requested, then `expand_rgb_to_rgba_row` if RGBA also requested.

Effective memory traffic for the both-buffers case: 3W RGB write + 3W L1-hot read + 4W RGBA write ≈ 7W — same as a hypothetical combined kernel ("Strategy B"), at ~1/10th the LOC.

Strategy B (a third const generic on every kernel doing both stores per pixel) is documented as a future follow-up in `docs/color-conversion-functions.md` § Ship 8 — only worth the LOC cost if profiling later shows the L1-readback step matters.

MixedSinker integration

`with_rgba` / `set_rgba` declared on format-specific impl blocks (per PR #16 safety pattern) — attaching RGBA to a sink that doesn't write it is a compile error rather than a silent stale-buffer bug. The `compile_fail` doctest negative example moved forward from `Nv24` to `Yuv440p` (next not-yet-wired format).

Doc updates

  • `docs/color-conversion-functions.md` § Ship 8 — tranche tracker updated (4a ✅ shipped, 4b ⏳ this PR, 4c next), new "Combined RGB + RGBA path: Strategy A (shipped) + Strategy B (deferred)" subsection enumerating the tradeoff space.
  • `docs/color-conversion-functions.md` § 2a (new) — "Real-world asset library format frequency" table calibrating § 2's priority tiers against post-production MAM / streaming / VFX / live-broadcast archetypes. Adds rows for AV1, AVC-Intra / Canon XF-AVC, camera RAW family, stills, and an "Other / unaccounted" residual; per-row notes flag the bimodal cases (DNxHR, ProRes 422). Tranche 6 (10-bit 4:2:2 RGBA) ranks as the single biggest unlock at 30–55% combined for post-production MAM workloads.

Tests

+16 lib tests on aarch64 (475 vs. 459 in PR #19); per-backend tests on the other 4 SIMD backends fire on their matching CI runners.

Layer Tests added
Scalar `expand_rgb_to_rgba_row` 3: alpha-pad / RGB-preserve, only-first-N-pixels invariant, zero-width no-op
Format-level Nv24 RGBA 4: gray-to-gray + opaque alpha, RGB-byte invariant, buffer-too-short, random-YUV SIMD parity (1922×4 frame, all 4 matrices × both ranges)
Format-level Nv42 RGBA 4: same shape
Cross-format Strategy A umbrella 1: `strategy_a_rgb_and_rgba_byte_identical_for_all_wired_families` exercises all 8 `process` impls and asserts `rgba[i4..i4+3] == rgb[i3..i3+3]` with `rgba[i*4+3] == 0xFF` per pixel
NEON per-backend (verified locally) 4: 16-pixel all-matrices + varied widths (1, 3, 15, 17, 32, 33, 1920, 1921 — odd widths validate the 4:4:4 no-parity contract) × NV24 / NV42
SSE4.1 per-backend (CI) 4: same shape
AVX2 per-backend (CI) 4: 32-pixel main loop + tail widths × NV24 / NV42
AVX-512 per-backend (CI) 4: 64-pixel main loop + tail widths × NV24 / NV42
wasm simd128 per-backend (CI) 4: 16-pixel + tail widths × NV24 / NV42

Per-backend tests bypass the dispatcher (call each backend's `unsafe nv24_to_rgba_row` / `nv42_to_rgba_row` directly under runtime feature detection) so on AVX-512-capable CI runners all three x86 paths run.

Local results (aarch64 macOS): 475 lib tests + 1 doctest pass; wasm32 + x86_64 cross-targets compile clean.

What's deferred

  • Tranche 4c — `Yuv440p` — wiring-only PR, reuses `yuv_444_to_rgba_row`.
  • Tranches 5–7 — high-bit-depth families.
  • `with_rgba_u16` ships in tranches 5–7.
  • YUVA source frames (Ship 8b) — independent follow-up.
  • Strategy B (combined kernel writing both stores per pixel) — future optimization, only if profiling shows the L1-readback step matters.
  • Cleanup PR after merge — split inline `mod tests` blocks out of large source files (`mixed.rs`, per-arch backends, `scalar.rs`); also covers visibility tightening on `_impl` functions (Copilot finding feat(NV12): NV12(semi-planar 4:2:0) + fallible PixelSink contract #2 — kept `pub(crate)` here for consistency with NV12/NV21's existing pattern; should land as a sweep across all `_impl`s) and RGBA-plane bounds-check helper extraction across all 8 `process` impls (Copilot finding feat(yuv420p10): 10-bit YUV 4:2:0 planar → u8 + native u16 RGB #4).

Test plan

  • CI green on `test`, `test-sde-avx512`, `cross`, `coverage`, `clippy`, `build`, `miri-*` jobs.
  • Per-tier coverage matrix exercises SSE4.1 / AVX2 / scalar paths via existing `colconv_disable_*` rustflags.
  • Verify Nv24 / Nv42 → both-buffers (RGB + RGBA) pipeline end-to-end with a real frame (gray + non-gray patches).
  • `cargo doc --lib --no-deps` clean (no new doc warnings vs. main).

🤖 Generated with Claude Code

@al8n al8n requested a review from Copilot April 26, 2026 06:36
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR adds RGBA output support for NV24/NV42 and introduces a “Strategy A” optimization to avoid running YUV→RGB math twice when both RGB and RGBA outputs are attached.

Changes:

  • Add NV24/NV42 RGBA row converters (scalar + SIMD dispatch) and wire them into MixedSinker.
  • Implement Strategy A fan-out (RGB -> RGBA expansion) to reuse the RGB kernel output when RGBA is also requested.
  • Add extensive tests covering NV24/NV42 RGBA behavior and SIMD-vs-scalar equivalence.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/sinker/mixed.rs Wires NV24/NV42 RGBA buffers, adds Strategy A selection logic, and adds sinker-level tests.
src/row/scalar.rs Adds NV24/NV42 scalar RGBA kernels and the expand_rgb_to_rgba_row helper + tests.
src/row/mod.rs Exposes NV24/NV42 RGBA row APIs with SIMD dispatch.
src/row/arch/x86_sse41.rs Adds SSE4.1 NV24/NV42 RGBA kernels and equivalence tests.
src/row/arch/x86_avx2.rs Adds AVX2 NV24/NV42 RGBA kernels and equivalence tests.
src/row/arch/x86_avx512.rs Adds AVX-512 NV24/NV42 RGBA kernels and equivalence tests.
src/row/arch/wasm_simd128.rs Adds wasm simd128 NV24/NV42 RGBA kernels and equivalence tests.
src/row/arch/neon.rs Adds NEON NV24/NV42 RGBA kernels and equivalence tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/sinker/mixed.rs
///
/// See [`MixedSinker::<Nv24>::with_rgba`] for the same rationale and
/// constraints; Nv42 differs only in chroma byte order (V before U).
/// Returns `Err(RgbaBufferTooShort)` if `buf.len() < width × height × 4`.
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Nv42::with_rgba doc currently mentions only RgbaBufferTooShort, but the implementation calls self.frame_bytes(4)?, which can also return GeometryOverflow (e.g., on 32-bit targets). Update the doc comment to match the actual error surface (similar to the Nv24 docs above).

Suggested change
/// Returns `Err(RgbaBufferTooShort)` if `buf.len() < width × height × 4`.
/// Returns `Err(GeometryOverflow)` if `width × height × 4` cannot be
/// represented on the current target, or `Err(RgbaBufferTooShort)` if
/// `buf.len() < width × height × 4`.

Copilot uses AI. Check for mistakes.
Comment thread src/row/scalar.rs
/// `out.len() >= width * (if ALPHA { 4 } else { 3 })`.
#[cfg_attr(not(tarpaulin), inline(always))]
fn nv24_or_nv42_to_rgb_row_impl<const SWAP_UV: bool>(
pub(crate) fn nv24_or_nv42_to_rgb_or_rgba_row_impl<const SWAP_UV: bool, const ALPHA: bool>(
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shared kernel function appears to be an internal implementation detail (and was previously private). If it’s not used outside row::scalar, consider making it fn (module-private) again to avoid unnecessarily widening the crate-internal API surface and reduce coupling.

Suggested change
pub(crate) fn nv24_or_nv42_to_rgb_or_rgba_row_impl<const SWAP_UV: bool, const ALPHA: bool>(
fn nv24_or_nv42_to_rgb_or_rgba_row_impl<const SWAP_UV: bool, const ALPHA: bool>(

Copilot uses AI. Check for mistakes.
Comment thread src/row/scalar.rs
Comment on lines +516 to +520
for x in 0..width {
rgba_out[x * 4] = rgb[x * 3];
rgba_out[x * 4 + 1] = rgb[x * 3 + 1];
rgba_out[x * 4 + 2] = rgb[x * 3 + 2];
rgba_out[x * 4 + 3] = 0xFF;
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expand_rgb_to_rgba_row is likely on a hot path. Consider iterating via chunks_exact(3) zipped with chunks_exact_mut(4) (or a similar approach) to avoid repeated index multiplications and help the compiler generate a tighter loop.

Suggested change
for x in 0..width {
rgba_out[x * 4] = rgb[x * 3];
rgba_out[x * 4 + 1] = rgb[x * 3 + 1];
rgba_out[x * 4 + 2] = rgb[x * 3 + 2];
rgba_out[x * 4 + 3] = 0xFF;
for (rgb_px, rgba_px) in rgb[..width * 3]
.chunks_exact(3)
.zip(rgba_out[..width * 4].chunks_exact_mut(4))
{
rgba_px[0] = rgb_px[0];
rgba_px[1] = rgb_px[1];
rgba_px[2] = rgb_px[2];
rgba_px[3] = 0xFF;

Copilot uses AI. Check for mistakes.
Comment thread src/sinker/mixed.rs
Comment on lines +1336 to +1347
if let Some(buf) = rgba.as_deref_mut() {
let rgba_plane_end =
one_plane_end
.checked_mul(4)
.ok_or(MixedSinkerError::GeometryOverflow {
width: w,
height: h,
channels: 4,
})?;
let rgba_plane_start = one_plane_start * 4;
expand_rgb_to_rgba_row(rgb_row, &mut buf[rgba_plane_start..rgba_plane_end], w);
}
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RGBA-plane bounds/overflow check + slice computation is duplicated across multiple PixelSink impls (Yuv420p/Yuv422p/Yuv444p/Nv12/Nv16/Nv21/Nv24/Nv42). Consider extracting a small helper (e.g., a method that returns the mutable RGBA row slice for [one_plane_start..one_plane_end]) to reduce repetition and the risk of future drift between formats.

Copilot uses AI. Check for mistakes.
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 26, 2026

Codecov Report

❌ Patch coverage is 92.59259% with 22 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/row/arch/neon.rs 75.00% 14 Missing ⚠️
src/row/mod.rs 71.42% 8 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/row/scalar.rs
debug_assert!(uv_or_vu.len() >= 2 * width, "chroma row too short");
debug_assert!(rgb_out.len() >= width * 3, "rgb_out row too short");
let bpp: usize = if ALPHA { 4 } else { 3 };
debug_assert!(out.len() >= width * bpp, "out row too short for {bpp}bpp");
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The debug_assert! message uses {bpp} inside a string literal, so the actual bpp value will not be interpolated. Consider using a formatted message (e.g., with {} + bpp) so debug builds report the concrete expected stride.

Suggested change
debug_assert!(out.len() >= width * bpp, "out row too short for {bpp}bpp");
debug_assert!(out.len() >= width * bpp, "out row too short for {}bpp", bpp);

Copilot uses AI. Check for mistakes.
@al8n al8n changed the title update feat(sinker): Ship 8 — Nv24/Nv42 RGBA + Strategy A RGB→RGBA fan-out Apr 26, 2026
@uqio uqio merged commit 3dc020d into main Apr 26, 2026
15 of 75 checks passed
@uqio uqio deleted the feat/ship8-rgba-nv24-nv42 branch April 26, 2026 07:05
uqio added a commit that referenced this pull request Apr 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants